Week1 - About the course: Introduction to Open Data Science

The course learning objectives are:

Link to my GitHub repository for this course.


Week2 - Regression and model validation

I prepared myself for this weeks topic by completing DataCamp exercise on “Regression and model validation”.

Data Wrangling

For the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned the basics of data wrangling in R and know how to find out more methods/tools and how to use them to wrangle the data.

My source code used in the wrangling is here.

Basically, here are the main points what is done in the code:

  • Original dataset is read from given web source.
  • Structure of the data is explored with str and dim.
  • learning2014 is created as new empty data.frame to store the required variables.
  • Variables gender, age, attitude and points are simply copied from the orignal dataset to learning2014.
  • The question variables from the original dataset are combined and stored as scaled variables in to learning2014. dplyr library is used here.
  • Observations with points < 1 are filtered out from learning2014.
  • learning2014` data.frame is saved as a text-file with same format as the original dataset.

The dataset (learning2014.txt) that the code created can be found from here.

Data Analysis

With dim function we learn the dimension of the input dataset. That is the function returns the number of columns and rows.

dim(learning2014)
## [1] 166   7

With str function we learn about the structure of the input dataset. Basically we get to know the number of observations (rows) and variables (columns) in the dataset. The output summary from the function lists the variables, the type of each variable vector and few example values from the vector.

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...

I use ggpairs to show detailed graphical overview of the input dataset, just like it was done in the DataCamp course for this weeks exercises. ggpairscreates matrix of plots from the given dataset. With aesthetics mapping we are instructing the colors to be based gender variable and the transparency is controlled with value give to alpha parameter. Parameter lower controls the type of plots for lower section of the plot matrix. All plots are between pair of variables in the learning2014 dataset and each are categorized based on gender variable. Above the diagonal with density plots there are the correlation coeffiecients of the pairs as a whole and then categorized based on gender variable.

From the plots we can see that nearly 2/3 of the observations are from females, thus 1/3 are from males. When we look at the single variables categorized by gender, we see that variable attitude deviates the most between sexes. Also females tend to be more younger than males based on variable age. There is also deviations between female and male in the combined question variables, where surface question group deviates the most and strategy to some extent.

When we look at the correlation coefficients we see that points and attitude correlate in agreement the most (0.437) in the dataset variables, with equally high for both females and males. Then deep and surf combined question variables correlate in disagreement the most (-0.324), with amongst all males the variables disagree (-0.622) clearly more than amongst females (-0.087). Then there are several pairs that have similar correlation in agreement (age,strategic), (deep,attitude) and (points,strategic), and several pairs that have similar correlation in disagreement (strategic,surface), (points,surface), (attitude,surface) and (age,surface).

ggpairs(learning2014, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20)))

With summary command we can see statistics of each variable in the dataset learning2014. For categorical variable gender we only get number of values for females and males. Then for each of the numerical varibles we get minimum, maximum, mean, median, 1st and 3rd quantiles.

summary(learning2014)
##  gender       age           attitude         points           deep      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   : 7.00   Min.   :1.583  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   1st Qu.:3.333  
##          Median :22.00   Median :32.00   Median :23.00   Median :3.667  
##          Mean   :25.51   Mean   :31.43   Mean   :22.72   Mean   :3.680  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75   3rd Qu.:4.083  
##          Max.   :55.00   Max.   :50.00   Max.   :33.00   Max.   :4.917  
##       surf            stra      
##  Min.   :1.583   Min.   :1.250  
##  1st Qu.:2.417   1st Qu.:2.625  
##  Median :2.833   Median :3.188  
##  Mean   :2.787   Mean   :3.121  
##  3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.333   Max.   :5.000

From the summary below we can see the explanatory variables (attitude, age, stra) I used to build the model with lm function. I reran the code several times with different combination of variables to find the most fit variables for the model by comparing the summaries of each of the models. Residuals are the difference between actual values of points and the predicted values of points. For most regressions the residuals should be normally distributed, where for goodness of the model can be estimated by the value of mean being close to 0. And here we are close to zero (0.3303). There is a shorthand for estimating the signifigance of the variables at the end of line for each variable in the Coefficients section of the summary. The more the stars at the end of the line the more significant the variable is, dot is still very good. And here we have *** for attitude and . for both stra and age. The p-value tells the probability of variable being NOT relevant, so small values here are good. And as we can see we have very low value for attitude and and still significantly low for both stra and age.

R-squared value tells how close the data are to the fitted regression line. It indicates the percentage that the model explains of the variability of the response data around its mean. 0% when none, 100% when all of the variability is explained. Here we have 21%. In general the higher the R-squared the better the model fits the data. Low R-squared value are not always bad. E.g. on occasions when trying to predict human behavior R-squared values are typically lower than 50%. Here we have a model fitted to variables descibing mostly (results of) human behavior. Also, if R-squared is low but the variables are statistically significant important conclusion can still be drawn. And here we have low R-squared with statistically significant variables.

## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     0.34808    0.05622   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

Model assumptions describe the data generating process. How well the assumptions fit reality the better the model describes the phenomenom of interest. With linear regression linearity is assumed: target is modelled as a linear combination of model parameters. Usually is also assumed that the errors are normally distributed, errors are not correlated and have constant variance, and that size of given error does not depend on the explanatory variables.

With QQ-plot of the residuals we can explore the assumption that of errors of the model are normally distributed. Here we in the 2nd plot we have a QQ-plot of the model and we can see that the majority of the explanatory variables fit well to the line and to the normality assumption.

The constant variance assumption can be explored with simple scatter plot of residuals versus model predictions. Any pattern on the scatter plot implies a problem with the assumption. Here the scatter plot is the 1st plot, and I must say that it is a bit difficult to say that is there a pattern (problem) in the plot especially when looking at with less fitted values and more fitted values. Then again, in between of less and more fitted values it is well scattered.

Leverage measures how much impact a single observation has on the model and residuals vs. leverage plot helps identifying observations with high impact. Here the last plot is the residuals vs. leverage plot. Again it is little hard for me to tell, but my interpretation is that there is no single outstanding observation that would drag the model.


Week3 - Logistic Regression

I prepared myself for this weeks topic by completing DataCamp exercise on “Logistic Regression”.

Data Wrangling

As I did last week, for the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned about joining datasets and doing data mutations.

My source code used in the wrangling is here.

Basically, here are the main points what is done in the code:

  • Initial dataset is acquired and extracted manually from given web source.
  • Two data sets are read in and combined as one, since they contain answers from same students, by joining them based on variables: school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internet
  • The columns not used in join are combined by taking rounded avarage of numerical values and by selecting the first value of not numeric values
  • alc_use variable was generated to indicate alcohol use (avarage of daily and weekly consumption)
  • high_use variables was generated to indicate high use of alcohol (true is alc_use > 2)

The dataset (alc.csv) that the code created can be found from here.

Data Analysis

Step 1 - New markdown file

Step one was to create chapter3.Rmd to inject this content to the diary as a child of subscipts in index.Rmd.

Step 2 - Combined dataset and attributes

The combined dataset contains the data from two Portuguese schools and were originally gathered from school reports and questionnaires. The data was collected initally in two datasets, one regarding student performance in mathemathics and one regarding student performance in Portuguese language. In addition to section Data Wrangling, in which the join criteria and additionally created variables are discussed, you can find more information about the variables in here.

Dimensions and the column (variable) names of the combined dataset:

dim(alc)
## [1] 382  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Column (variable) types of the combined dataset:

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

Step 3 - Variables to study in relations with alcohol consumption

My hypothesis to study is how time consumption is related with alcohol consumption. The idea being that if you spend your time with activities you would not have time to consume alcohol and vice versa. My choice of variables are in sense time consuming variables: freetime, goout, studytime and traveltime.

  • traveltime - Home to school travel time. My intuition here is this would be a factor to decrease alcohol consumption.
  • studytime - Weekly study time. My intuition here is this would be a factor to decrease alcohol consumption.
  • freetime - Free time after school. My intuition here is this would be a factor to increase alcohol consumption.
  • goout - Going out with friends. My intuition here is this would be a factor to increase alcohol consumption.

Step 4 - Numerical and graphical exploration on chosen variables

Here I use ggpairs, from last week, to visualize the chosen variables together with the variable high_use. From the diagram we get quickly a good overview of the relations of the variables between themselves and with the variable high_use. By looking at the correlation coefficients we can observe that there is not much correlation between the chosen variables. We could say that freetime and goout mildly correlate positively, meaning that you would go out with friends during your freetime which makes common sense. Also that traveltime and studytime mildly correlate negatively, the same with freetime and studytime, meaning that travel time between school and home, also freetime, is not used for studying. My interpretation is that the low correlation between the chosen variables could tell that variables are not interfering with each other and be independently a factor to increase or decrease alcohol consumption. And that can be seen in favor of my hypothesis.

From the density plots we can see that there are some deviations for each of the chosen variable when comparing to high/low alcohol usage. Most distinctive difference is between goout and high_use where we could say that those who go out more have the tendency to high alcohol consumption. When comparing freetime and high_use density one could argue that when you have less free time there is less observations of high alcohol consumptions which are growing when you have more free time. Similar argument can be made when comparing studytime and high_use, where there are less high alcohol consumption when more time spent on studies. And also when comparing traveltime and high_use one could say that if you live close to school you have more time to consume alcohol. In my interpretation these are still supporting my hypothesis, but it is not so clear because the densities between low/high alcohol consumption per chosen variable are very similar: there are 1/3 high consuming versus 2/3 low consuming observations, and the values of the attiributes are dicrete and few.

Looking at the box and bar plots of each chosen variable divided by alcohol we can see that traveltime and freetime are quite identically distributed both in terms of high and low alcohol consumption. When looking at the high alcohol consumption with freetime it seems that the hight alcohol consumption increases more when freetime increases, whereas low alcohol consumption increases less when freetime increases.

With studytime and goout the distributions differ more in terms of high and low alcohol consumption. When looking at the high alcohol consumption with studytime it seems that the high alcohol consumption increases more when studytime decreases compared to low alcohol consumption. Also high alcohol consumption increases more when go out time with friends increases compared to low alcohol consumption.

Step 5 - Statistical exploration with logistic regression

Here we use logistic regression to statistically explore the relationship between chosen variables and the binary high/low alcohol consumption.

## 
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime + freetime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5805  -0.7670  -0.5270   0.8878   2.4724  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.9531     0.7013  -4.211 2.54e-05 ***
## studytime    -0.5612     0.1655  -3.391 0.000697 ***
## goout         0.7206     0.1212   5.947 2.73e-09 ***
## traveltime    0.3619     0.1752   2.066 0.038868 *  
## freetime      0.0921     0.1354   0.680 0.496389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 396.18  on 377  degrees of freedom
## AIC: 406.18
## 
## Number of Fisher Scoring iterations: 4

From the summary we can see the distribution of deviance residuals for the individual cases used in the model. The table of coeffiecients shows the coefficients and its standard error for each variable. The coefficients tell the change in the log odds of the outcome for a one unit increase in the predictor variable (e.g. one unit change in goout gives the log odds of high_use increase by 0.7206). From the p-values we can see that the probability of each variable NOT being relevant (so low score here is good). From the significance indicators we can tell that studytime, goout and traveltime are statistically significant, but freetime is not.

From the below odd ratios of the coefficients we can say that each of the chosen variable is positively associated with high consumption of alcohol (when we have defined high_use as binary, TRUE / FALSE). From the odds ratios and confidence intervals we can say that since studytime is below 1 values (the odds ratio and the confidence interval) it is in favor of low alcohol consumption. Both goout and traveltime is above 1 so they both are in favor of high alcohol consumption. Where as freetime has interval that includes value 1, meaning that it is random to which side the variable is in favor of.

##                    OR      2.5 %    97.5 %
## (Intercept) 0.0521788 0.01274364 0.2006731
## studytime   0.5705061 0.40821453 0.7826192
## goout       2.0556712 1.63108393 2.6258521
## traveltime  1.4359821 1.01912982 2.0303214
## freetime    1.0964726 0.84079237 1.4317362

Based on summary and odds ratios I conclude that studytime, goout and traveltime support my hypothesis but surprisingly freetime does not. And since freetime is not statistically significant I will drop that variable from further parts of this exercise analysis.

Step 6 - Predicative power of the model

First we need to re-fit the model by taking out the statistically not significant freetime variable.

## 
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5318  -0.7679  -0.5214   0.8604   2.4526  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.6967     0.5878  -4.588 4.47e-06 ***
## studytime    -0.5716     0.1650  -3.464 0.000532 ***
## goout         0.7431     0.1172   6.343 2.25e-10 ***
## traveltime    0.3559     0.1746   2.039 0.041486 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 396.64  on 378  degrees of freedom
## AIC: 404.64
## 
## Number of Fisher Scoring iterations: 4

Then we use the model to predict probabilities for the date that was used to train the model. Based on the probabilities we make predictions for the training data observations so that probability > 0.5 is high alcohol consumption, otherwise low. In the first table below we can see the counts of the true values of high_use and of the predictions. And in the second table are the proportions of the same. From these we can compute that the model gets it (0.63874346 + 0.11780105 = 0.75654451) ~76% times correct, meaning the training error being ~24%. The model gets 91% of the true low alcohol consumption observations correct, but only ~39% of the true high alcohol consumption observations correct.

##         prediction
## high_use FALSE TRUE Sum
##    FALSE   244   24 268
##    TRUE     69   45 114
##    Sum     313   69 382
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.63874346 0.06282723 0.70157068
##    TRUE  0.18062827 0.11780105 0.29842932
##    Sum   0.81937173 0.18062827 1.00000000

As as performance comparison to a simple guessing strategy, with weights 2/3 to low alcohol consumption and 1/3 to high alcohol consumption, we can see that the model beats this guessing strategy. The results from the simple guessing strategy is below. Simple guessing strategy get ~59% correct from the true totals, ~70% of the true low alcohol consumption and ~32% of the true high alcohol consumption. (Note: Since guessing strategy is based on randomizer there might be slight changes on percentages written above and how many time this rmarkdown code is run after what I had written)

##         prediction
## high_use   0   1 Sum
##    FALSE 187  81 268
##    TRUE   77  37 114
##    Sum   264 118 382
##         prediction
## high_use          0          1        Sum
##    FALSE 0.48952880 0.21204188 0.70157068
##    TRUE  0.20157068 0.09685864 0.29842932
##    Sum   0.69109948 0.30890052 1.00000000

Here is also graphical visualization of the actual and predicted values.

Step 7 - 10-fold cross-validation BONUS

For the 10-fold cross-validation we need to define a loss function. I use the same definition of loss function as was done in the Data Camp exercises. Here is the error-rate of the above trained model computed with the loss function.

## [1] 0.2434555

And here is the error-rate of the trained model wih 10-fold cross-validation. As you can see the model slightly outperforms the model used in Data Camp exercise.

## [1] 0.2539267

Step 8 - Cross-validation of performance of different logistic regression models SUPERBONUS

In this part of the exercise set I used only variables that are of numeric type (type:int). There are 14 such variables. I use for loop to add a predictor in one-by-one fashion and do the cross-validation step to store the error in betweens. Unfortunately my plotting skills are poor and I was not able to flip the value-labels of x-axis (was able to flip the “curve” but decided not to since it would make it more confusing), hence the errors are represented in this plot in increasing number of predictors order.

##  [1] 0.2958115 0.2958115 0.2958115 0.2879581 0.2879581 0.2879581 0.2879581
##  [8] 0.2958115 0.2565445 0.2670157 0.2539267 0.2513089 0.2486911 0.2382199

Week4 - Clustering and classification

Also for this week I prepared myself by completing DataCamp exercise on “Clustering and classification”.

Data Analysis

Step 1 - New markdown file

Step one again was to create chapter4.Rmd to inject its content to the actual course diary as a child of subscipts in index.Rmd.

Step 2 - THe Boston dataset from MASS package

In this exercise we are using the Boston dataset from MASS packege for R. The dataset contains “Housing Values in Suburbs of Boston” according to the dataset description in here where you can also find details about the dataset columns.

Boston dataset is a class type of: data.frame. The dimensions of the dataset are 506 observations (rows) with 14 variables (columns).

Structure of the dataset is listed below.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Step 3 - Graphical overview and the varible summaries of the Boston dataset

First, with a summary, and together with the additional information available in here, we can take a look at the different scales of each of the variables. There are different rates, ratios, proprotions, avarages, means, medians and dummy (boolen indicator) as values used for different variables.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Second, we can look at the density plots of each variable. I was tempted to draw relationship conclusion already based on these diagrams, but as we can see from correlation matrix (coming up next) things are not so obvious.

Third, with a type of correlation matrix we can observe the relationships between the variables. I used a mixed version of the corrplot which was introduced in the DataCamp exercises. From this one you can visually locate the varibles with positive (→ 1) and negative (→ -1) relations from the top triangle and check their coefficients from the lower triangle. We can observe that variable pair (rad, tax) have the highest (> 0.9) positive relationship. Also pairs (indus,nox), (indus,tax), (nox,age) and (rm,mediv) have high posivite relationship (> 0.7). On the high negative relationships (< -0.7) we can mention pairs (indus,dis), (nox,dis), (age,dis) and (lstat,medv). Also noteworthy is that variable chas does not seem to correlate positively or negatively with any other variable in the dataset.

Step 4 - Standardizing the Boston dataset

After scaling the variable values are scaled to normal distribution with 0 mean (or very close) and variance 1. Meaning the amout of values are equally numbered on both sides of 0 mean. This can be observed from the density plots after the summary (compare to similar density plot above).

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then the based on the quantiles of the scaled crime rate variable crim a categorical variable is created. Original variable crim is removed and the new created variable crime is added to the date set. Here is the stucture of the modified dataset.

## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...

Dataset is split into two, so that 80% of the data is in train set and 20% in the test set.

## [1] 404  14
## [1] 102  14

Step 5 - Fitting the LDA

Here the LDA model is fitted with the train set. The model summary is printed. We can see that LD1 covers more than 0.95 of the group variance. Next the LDA is plotted as biplot. I used just 1 discriminant. There is something going on with the arrows, they don’t work on the same scale as in the DataCamp. In order to even get some of the arrows visible I needed scale way more. (the number of discriminants did not have any effect on this property).

!!Tue 27th!! Found the issue, not the arrows but wrong data to plot :(

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2376238 0.2623762 0.2524752 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       0.93520882 -0.8907091 -0.11484506 -0.8461428  0.4427384
## med_low  -0.08995672 -0.2868547 -0.02626030 -0.5436521 -0.1974597
## med_high -0.38045185  0.1548430  0.17338039  0.3563715  0.1188784
## high     -0.48724019  1.0171096 -0.04073494  1.0719641 -0.4264826
##                 age        dis        rad        tax     ptratio
## low      -0.8504039  0.8282951 -0.6959029 -0.7237406 -0.44783279
## med_low  -0.3196882  0.3639736 -0.5488034 -0.4678598 -0.02228321
## med_high  0.3883868 -0.3655824 -0.3870520 -0.2863228 -0.29451506
## high      0.7836698 -0.8384347  1.6382099  1.5141140  0.78087177
##                black       lstat        medv
## low       0.37713478 -0.75166181  0.49724279
## med_low   0.30962322 -0.09439842 -0.04649268
## med_high  0.08005809 -0.02617694  0.19348023
## high     -0.69571287  0.87641107 -0.65573310
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.12516906  0.86966092 -0.94732196
## indus   -0.02273021 -0.17179618  0.18549215
## chas    -0.08212910 -0.07022140  0.07547652
## nox      0.42388257 -0.65600866 -1.21121635
## rm      -0.11735441 -0.02662857 -0.24806326
## age      0.30134507 -0.32963416 -0.16565107
## dis     -0.05014585 -0.37462365  0.32242714
## rad      2.98404790  1.05684898 -0.14161191
## tax      0.03103018 -0.26536574  0.64453024
## ptratio  0.11510171  0.05685743 -0.19787389
## black   -0.13853306  0.04292244  0.11162022
## lstat    0.19938790 -0.23811448  0.41562347
## medv     0.19314756 -0.46312460 -0.11590396
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9479 0.0384 0.0137
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# fixed this part
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)

Step 6 - Predicting with the LDA

The LDA model fitted in previous step is used to predict the crime class for test data after real test values are removed from the data. Then the predicted class results are compared against the known true values in the table below. We can see that on the test data the predicted values for class high are almost perfect. Class med_high seems to have the worst prediction precision since almost half of the values are predicted to other classes. Class med_low gets almost similarly bad result where 2/3 of the values are predicted to wrong classes. Class low has the second best prediction precision but for that almost 1/3 get predicted to wrong class.

##           predicted
## correct    low med_low med_high high
##   low       15      12        0    0
##   med_low    6      16        8    0
##   med_high   1       5       14    0
##   high       0       0        0   25

Step 7 - Clustering with K-means

First Boston dataset is reloaded and scaled again.

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...

Then the euclidian distances between the scaled observations is computed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Simple initial run of K-means algorithm with randomly chosen number of cluster centers (5). The pairs plot is hard to read when all variables are compared with each other. Splitting the data for pairs plot makes it more readable but you will then miss the comparson between all variables.

To determine the optimal number of clusters total WCSS (within cluster sum of squares) can be used for help. Here I use 14 as the maximum cluster, just based on the number of variables in the dataset, and quick plot the tWCSS results. The optimal number of clusters is when the tWCSS drops dramatically, and here we get the same result (2 clusters) as in the Data Camp exercise.

Running the K-means algorithm again but now with 2 cluster as per optimal suggestion. Plotting the results with pairs to see all variables compared and split to two clusters.

Step Bonus - Finding classes to train LDA with K-means

I found the instructions for this bonus exercise a little ambiguous, so I took some freedom since it is not clearly stated how some of the things should be done.

So, first I took the Boston dataset and scaled it. Did nothing to crim, no categorization.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then I ran k-means on the scaled data with 3 clusters specified. Showing the pairs plot. Took the cluster vector from results and added that as new column to origal scaled data.

## 'data.frame':    506 obs. of  15 variables:
##  $ crim   : num  -0.419 -0.417 -0.417 -0.416 -0.412 ...
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ cluster: int  1 1 1 1 1 1 1 1 1 1 ...

I did the same split into train and test dataset, as we did here earlier, so that I can see how well is the model predicting. Most influencial linear separators being medv and chas.

## Call:
## lda(cluster ~ ., data = train)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.5148515 0.3391089 0.1460396 
## 
## Group means:
##         crim          zn      indus        chas         nox          rm
## 1 -0.3941546  0.31227411 -0.6370635 -0.27232907 -0.65912810 -0.01684603
## 2  0.7549052 -0.48724019  1.1552802 -0.09990132  1.11285090 -0.46105942
## 3 -0.3149865  0.02365179 -0.3280701  1.39593374 -0.09317372  1.15993858
##          age        dis        rad        tax     ptratio      black
## 1 -0.6086596  0.6311264 -0.6003372 -0.6010240 -0.07361837  0.3163746
## 2  0.8059442 -0.8487400  1.0652503  1.1624509  0.51852048 -0.5592621
## 3  0.2613110 -0.3401229 -0.3667601 -0.5552257 -0.97842899  0.2623650
##        lstat        medv
## 1 -0.3954831  0.09021249
## 2  0.9137863 -0.77094599
## 3 -0.6633176  1.38467393
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim     0.01117325  0.09031124
## zn       0.42825390 -0.07804598
## indus    1.27189292  0.21207572
## chas    -0.29464473  0.82381171
## nox      0.81644143 -0.31106867
## rm       0.14381601  0.56471935
## age     -0.11369692  0.51461099
## dis      0.12926754 -0.35989536
## rad      0.70042482  0.35646727
## tax      0.31049220 -0.22916058
## ptratio  0.06106314 -0.40680906
## black   -0.04908228 -0.05389697
## lstat    0.32333200  0.40960024
## medv    -0.16431904  0.87258292
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7556 0.2444

##        predicted
## correct  1  2  3
##       1 57  1  0
##       2  0 33  0
##       3  2  1  8

And we can see that the precidtion is quite precise. But what these 3 classes should represent with this data (when above we were predicting crime).

Data Wrangling

In this part it was needed to read two datasets from the internet. Then rename the column names in both datasets, create 2 new variables and the combine the data in to one dataset and store it. I had some problems when reading the stored data, I did not get the same amount of observations that was in the file (I got 164 instead of 195 in the file). Did not see really anything wrong in the file, but I turned quotes to it default value (TRUE) when saving the data, and after that the data with correct amount of observations could be read.

The code for the date can be found from here
And the data can be found from here


Week4 - Dimensionality reduction techniques

This week we took a look at dimensionality reduction methods namely principal component ananlysis (PCA) and multiple correspondence analysis (MCA).

There was the usual Data Camp exercise as on introduction to dimensionality reduction techniques.

Rest of the exercises are presented in this chapter of the diary.

Data Wrangling

In the data wrangling part we used the combined dataset from last weeks exercise and did some further manipulations to it.

The script for the data mutilation is here.

And the resultind data is here.

Data Analysis

Step 1 - Summary and graphical overview of the data

As I had stored previously my data file as .csv I noticed that when reading the file with read.csv2 numeric (n.i. integers) are interpreted as Factorials. With that one needs to convert those string back to numeric form. Or other way around it is to use read.table and then such variables are “automagically” interpreted as numerics.

First I just read.table and show the structure of the file.

## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Next I visualize the data with ggpairs and corrplot.mixed. From the ggpairs plot we can observe the density of each variable. In this plot there are the correlation coefficients on upper part, but it is a little difficult to read. Then the scatter plots in the lower part show the pairwise correlation.

The corrplot.mixed shows nicely the pairwise correlation of each variable. The two most similar variable pairs are (Edu.Exp,Life.Exp) and (Mat.Mor,Ado.Birth). And the two most dissimilar variable pairs are (Life.Exp,Mat.Mor) and (Mat.Mor,Edu.Exp).

From the summary we can see that there is some difference in the scale of the value when comparing the variables. We can also observe the variance of each variables values.

##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Step 2 - Principal Component Analysis on non-standardized data

First the PCA is performed with non-stardardized human data (summary of non-standarized data above). The table below is the summary of the performed PCA, showing that the first principal component (PC1) captures almost all, ~99.99%, of the variance in the non-stardardized data.

## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

Step 3 - Principal Component Analysis on standardized data

Next the PCA is performed with stardardized human data (summary of standarized data below).

##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

The summary of the PCA performed on standardized data is shown in table below. The first principal component (PC1) captures roughly 53% and the second principle component (PC2) roughly 16% of the variance in the data, making it almost 70% of the variance capture by the first two principal components. The results between the PCA on non-standardized and standardized data differs a lot because PCA method assumes that variables with larger variance are more important and hence the method is sensitive to the relative scaling of the original variables.

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000

From the biplot below we can see which variables are correlating with each other and with which PC. Group 1, with Abo.Birth and Mat.Mor, are pointing to the same direction and have a small angle between them, meaning they have strong positive correlattion. Group 2, with Parli.F and Labo.FM, are pointing to the same direction, meaning they have a strong positive correlation. Group 3, with GNI, Edu.Exp, Life.Exp and Edu2.FM, are pointing to the same direction and have a small angles between each other, meaning they have a strong positive correlation. Group 2 is almost orthogonal to both Group 1 and Group3 meaning Group 2 variables do not correlate/effect the variables of two other groups. Group 1 and Group 2 are pointing to opposite directions meaning the Group 1 and have negative correlation between the variables in those groups. Also, Group 1 variables is almost parallel with the second principal component and Groups 1 and 3 with the first principal component, meaning the groups have high positive correlation to those PC’s respectively.

Step 4 - Personal interpretation of the PCA on human data

My initial interpretation/understanding about PCA follows from the course material. It is a dimension reduction method for high dimensional data. It captures the most important dimensions (variables) from high-dimensional dataset as new (principal) components. Each PC is linear combination of the original dimensions that captures the variance in dataset, the first PC capturing the most of the variance (information), the consecutive PC’s capture the most variance left and is uncorrelated to the previous component. The biplot above and the table below explain the orignal values contributing each of the two principal component: PC1 is contributed by the variables in Group 1 and in Group3, PC2 if contributed by the variables in Group 2.

##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476

Step 5 - Why MCA on tea data?

(Sorry, couldn’t resist a joke on YMCA.)

1st try

By loading the tea dataset from FactoMineR package and by observing it can be seen that the dataset holds 300 observations with 36 variables. 35 variables are factorials and only one variable (age) is numerical. Age is also represented as categorical variable age_Q in the dataset.

## [1] 300  36
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...

Removed variable age from the dataset and used rest of the variables for visualization below.

## [1] 300  35

For the MCA I will consider only the variables with binary values. Only for the reason that when running the MCA with all variables the dimensions seems not to be 1:1 with variables → each exrta value is its own dimension resulting in 54 dimension if all variables are included. I have no other criteria on what to choose or not to choose as variables, just thinking is it more simple just with binary valued variables.

## 
## Call:
## FactoMineR::MCA(X = tea_sup, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.093   0.070   0.061   0.059   0.054   0.050
## % of var.              9.329   6.992   6.105   5.871   5.356   5.050
## Cumulative % of var.   9.329  16.322  22.427  28.298  33.654  38.704
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.045   0.042   0.041   0.038   0.038   0.037
## % of var.              4.527   4.159   4.124   3.840   3.794   3.704
## Cumulative % of var.  43.231  47.390  51.514  55.354  59.148  62.853
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17  Dim.18
## Variance               0.036   0.033   0.032   0.030   0.029   0.027
## % of var.              3.587   3.253   3.156   2.964   2.892   2.690
## Cumulative % of var.  66.439  69.692  72.848  75.811  78.703  81.393
##                       Dim.19  Dim.20  Dim.21  Dim.22  Dim.23  Dim.24
## Variance               0.026   0.025   0.023   0.022   0.020   0.020
## % of var.              2.586   2.479   2.263   2.203   1.982   1.976
## Cumulative % of var.  83.979  86.459  88.722  90.925  92.906  94.882
##                       Dim.25  Dim.26  Dim.27
## Variance               0.019   0.017   0.015
## % of var.              1.851   1.739   1.527
## Cumulative % of var.  96.733  98.473 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             | -0.683  1.668  0.536 |  0.139  0.092  0.022 | -0.412
## 2             | -0.459  0.753  0.246 | -0.231  0.254  0.062 | -0.421
## 3             |  0.081  0.024  0.005 |  0.360  0.617  0.103 | -0.142
## 4             | -0.702  1.762  0.368 | -0.184  0.162  0.025 |  0.220
## 5             | -0.263  0.247  0.076 |  0.148  0.105  0.024 | -0.092
## 6             | -0.772  2.128  0.454 |  0.079  0.030  0.005 | -0.227
## 7             | -0.351  0.441  0.163 |  0.289  0.398  0.110 | -0.511
## 8             |  0.055  0.011  0.005 |  0.033  0.005  0.002 | -0.143
## 9             | -0.430  0.662  0.232 |  0.178  0.151  0.040 | -0.321
## 10            | -0.119  0.050  0.016 |  0.132  0.083  0.020 | -0.307
##                  ctr   cos2  
## 1              0.927  0.195 |
## 2              0.970  0.207 |
## 3              0.110  0.016 |
## 4              0.265  0.036 |
## 5              0.046  0.009 |
## 6              0.281  0.039 |
## 7              1.424  0.344 |
## 8              0.112  0.033 |
## 9              0.563  0.129 |
## 10             0.513  0.106 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## always        |   0.361   1.778   0.068   4.516 |  -0.095   0.164   0.005
## Not.always    |  -0.189   0.930   0.068  -4.516 |   0.050   0.086   0.005
## breakfast     |   0.062   0.072   0.004   1.025 |   0.075   0.143   0.005
## Not.breakfast |  -0.057   0.067   0.004  -1.025 |  -0.069   0.132   0.005
## dinner        |  -1.257   4.389   0.119  -5.962 |  -0.195   0.142   0.003
## Not.dinner    |   0.095   0.330   0.119   5.962 |   0.015   0.011   0.003
## evening       |   0.293   1.167   0.045   3.659 |   0.674   8.250   0.237
## Not.evening   |  -0.153   0.610   0.045  -3.659 |  -0.352   4.313   0.237
## lunch         |   0.451   1.183   0.035   3.232 |   0.807   5.060   0.112
## Not.lunch     |  -0.077   0.203   0.035  -3.232 |  -0.139   0.870   0.112
##                v.test     Dim.3     ctr    cos2  v.test  
## always         -1.188 |   0.281   1.645   0.041   3.514 |
## Not.always      1.188 |  -0.147   0.860   0.041  -3.514 |
## breakfast       1.246 |  -0.610  10.830   0.343 -10.132 |
## Not.breakfast  -1.246 |   0.563   9.997   0.343  10.132 |
## dinner         -0.927 |   0.798   2.706   0.048   3.787 |
## Not.dinner      0.927 |  -0.060   0.204   0.048  -3.787 |
## evening         8.421 |   0.321   2.147   0.054   4.015 |
## Not.evening    -8.421 |  -0.168   1.123   0.054  -4.015 |
## lunch           5.786 |  -0.258   0.593   0.011  -1.850 |
## Not.lunch      -5.786 |   0.044   0.102   0.011   1.850 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## always        | 0.068 0.005 0.041 |
## breakfast     | 0.004 0.005 0.343 |
## dinner        | 0.119 0.003 0.048 |
## evening       | 0.045 0.237 0.054 |
## lunch         | 0.035 0.112 0.011 |
## tea.time      | 0.280 0.058 0.026 |
## friends       | 0.197 0.097 0.048 |
## home          | 0.009 0.011 0.047 |
## pub           | 0.140 0.010 0.000 |
## resto         | 0.128 0.135 0.007 |

Selecting many variables (this has 27) makes the MCA plot less readable. It can be observed that many variables are close, but then again which?

2nd try

Here I use the original dataset again, this time using the parameters of the MCA call to denote the different types of variables. Variables 1-18 is denoted as active, 19 (age) as supplemetentery continuous variable and variables 20-36 are as supplementery categorical. Then I use plot.MCA to output several plots for the performed MCA.

In the first plot there are the individuals and we can see there are not groupings here.

Second is the active (in MCA) categorical variables. Hard to make judgements here.

Third is the supplementary variables which did not contribute in the MCA.